Occam factors and 
model— independent Bayesian learning of continuous distributions 
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Learning of a smooth but nonparametric probability density can be regularized using methods 
of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and 
show that the data and the phase space factors arising from the integration over the model space 
determine the free parameter of the theory ( "smoothness scale" ) self-consistently. This persists even 
for distributions that are atypical in the prior and is a step towards a model-independent theory for 
learning continuous distributions. Finally, we point out that a wrong parameterization of a model 
family may sometimes be advantageous for small data sets. 
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I. INTRODUCTION 

One of the central problems in learning is to balance 
"goodness of fit" criteria against the complexity of mod- 
els. An important development in the Bayesian approach 
was thus the realization that there does not need to be 
any extra penalty for model complexity: if we compute 
the total probability that data are generated by a model, 
there is a factor from the volume in parameter space — the 
"Occam factor" — that discriminates against models with 
more parameters WM or, more specifically, against mod- 
els which are more complex in a precise information the- 
oretic sense 0]. This works remarkably well for systems 
with a finite number of parameters and creates a com- 
plexity "razor" (after "Occam's razor") that is almost 
equivalent to the celebrated Minimal Description Length 
(MDL) principle [Q . In addition, if the a priori distribu- 
tions involved are strictly Gaussian, the ideas have also 
been proven to apply to some infinite-dimensional (non- 
parametric) problems |q]. It is not clear, however, what 
happens if we leave the finite dimensional setting to con- 
sider nonparametric problems which are not Gaussian, 
such as the estimation of a smooth probability density. 
A possible route to progress on the nonparametric prob- 
lem was opened by noticing |5J that a Bayesian prior for 
density estimation is equivalent to a quantum field the- 
ory (QFT). In particular, there are field theoretic meth- 
ods for computing the infinite dimensional analog of the 
Occam factor, at least asymptotically for large numbers 
of examples. These observations have led to a number 
of papers p[-pl| exploring alternative formulations and 
their implications for the speed of learning. Here we re- 
turn to the original formulation of Ref. B and address 
some of the questions left open by the previous work pi : 
What is the result of balancing the infinite dimensional 
Occam factor against the goodness of fit? Is the QFT 
inference optimal in using all of the information relevant 
for learning B? What happens if our learning problem 
is strongly atypical of the prior distribution? 

The conclusions we finally make were not expected by 



us at the start of the project, and they will probably 
be not intuitively obvious to most of our readers either. 
Thus we chose to present this work in the same way it 
had originally proceeded. First we develop a numeri- 
cal scheme for implementation of the learning algorithm 
of Ref. S . Then we show some results of Monte-Carlo 
simulations with this algorithm and notice some peculiar 
features that have not been predicted by the previous lit- 
erature. Concurrently with the simulations, we present 
a simple analytical argument that explains these unex- 
pected but extremely desirable features. 



II. PRELIMINARIES 

Following Ref. [gj, if A^ independent, identically dis- 
tributed samples {xi}, i = I . . . N, are observed, then the 
probability that a particular density Q{x) gave rise to 
these data is given by 
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where V[Q{x)] encodes our a priori expectations of Q. 
Specifying this prior on a space of functions defines a 
QFT, and the optimal least square estimator is then the 
a posteriori Bayesian average 



Qcstix\{Xi}) 



{Q{x)Q{xi)Q{x2)...Q{xn))^°^ 
{Q{xi)Q{x2)...Q{xn))^o) 



(2) 



where (. . .)'"■' means averaging with respect to the prior. 
Since Q{x) > 0, it is convenient to define an uncon- 
strained field (j){x), Q{x) = (l/^o) exp[— 0(x)], where the 
choice of the dimension setting constant (.q must not in- 
fiuence any final results. Other definitions are also pos- 
sible [0, but we think that most of our results do not 
depend on this choice. 

Next we should select a prior that regularizes the in- 
finite number of degrees of freedom and allows learning. 
We want the prior V[4)] to make sense as a continuous 



theory, independent of discretization of x on small scales. 
Since it is not clear what a renormalization procedure for 
a probability density would mean, we also require that 
when we estimate the distribution Q{x) the answer must 
be everywhere finite. These conditions imply that our 
field theory must be ultraviolet (UV) convergent. For x 
in one dimension, a minimal choice is 
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where rj > 1/2, Z is the normalization constant, and the 
(5- function enforces normalization of Q. We refer to (. and 
7/ as the smoothness scale and the exponent^ respectively; 
they would be called hyperparameters in other machine 
learning literature M. 

In g this theory was solved for large N and rj = 1 using 
the familiar WKB techniques. The saddle point (or the 

classical solution) for the 4> averaging in (ni=i Qi^i))^^^ 
was found to be given by 
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and the fluctuation determinant around this saddle is 



R — exp 
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(5) 



Then the correlation functions take a familiar form: 
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{\{Q{x^))^''^ ~ -^exp(-5eff[0ci(x);{xj]), (6) 

5'cff= - / dx{d^<j)^if + V (l)d{.xj) - \ogR, (7) 

In Ref. S it was shown that, with such correlation 
functions, Eq. (||) is a "proper" solution to the learning 
problem: it is nonsingular even at finite N, it converges 
to the target distribution P{x) that actually generates 
the data, and the variance of fluctuations around the tar- 
get, i'ix) = — log(5ost(a;) — [— log-^o-P(2;)], falls off rather 
quickly as ~ \/ ^/INPJx). It was also noted that the 
effective action [Eq. (0)] has acquired a term — log R, 
which grows as ^ decreases. This is contrary to the data 
contribution, X]i=i 4'c\{xj), which favors small (, and the 
corresponding overfitting. Thus the — log R term may be 
rightfully called an infinite dimensional generalization of 
the Occam factors. The authors speculated that, if the 
actual d is unknown, one may average over it and hope 
that, much as in Bayesian model selection 0,0], the com- 
petition between the data and the fluctuations will select 
the optimal smoothness scale (* . Finally, they suggested 
that this optimal scale might behave as (.* ~ iV^/'^. 

Before we proceed on to the numerical implementa- 
tion of the above algorithm, a note is in order. At first 



glance the theory we study seems to look almost exactly 
like a Gaussian Process [|6| . This impression is produced 
by a Gaussian form of the smoothness penalty in Eq. (||), 
and by the fluctuation determinant that plays against the 
goodness of fit in the smoothness scale (model) selection. 
However, both similarities are incomplete. The Gaussian 
penalty in the prior is amended by the normalization 
constraint, which gives rise to the exponential term in 
Eq. (0), and violates many familiar results that hold for 
Gaussian Processes, the representer theorem [|2| being 
just one of them. In the semi-classical limit of large TV, 
Gaussianity is restored approximately, but the classical 
solution is extremely non-trivial, and the fluctuation de- 
terminant is only the leading term of the Occam's razor, 
not the complete razor as it is for a Gaussian Process. In 
addition, it depends on the data only through the clas- 
sical solution; this is remarkably different from the usual 
determinants arising in the Gaussian Processes literature 



III. THE ALGORITHM 

Numerical implementation of the theory is rather sim- 
ple. First, to eliminate a possible infra-red singularity 
in Eq. (||) pjll|], we confine x to a box Q < x < L with 
periodic boundary conditions. The boundary value prob- 
lem Eq. (^) is then solved by a standard "relaxation" (or 
Newton) method of iterative improvements to a guessed 
solution [1^] (for the target precision we always use 10~^). 
The independent variable x £ [0, 1] is discretized in equal 
steps [10" for Figs. (|i|-|), and 10^ for Figs. (|,|6|)]. We use 
an equally spaced grid to ensure stability of the method, 
while small step sizes are needed since the scale for vari- 
ation of 4>c\{x) is [|| 



6xr^ ^/i/NP{x) , 



(8) 



which can be rather small for large N or small (.. 

Since the theory is UV convergent, we can gener- 
ate random probability densities chosen from the prior 
Eq. (pi) by replacing with its Fourier series and trun- 
cating the latter at some sufficiently high wavenumber 
k^ [k^ = 1000 for Figs. (0^, and 5000 for Figs. (|, §]. 
Then Eq. (|^) enforces the amplitude of the fc'th mode 
{k > 0) to be distributed a priori normally around zero 
with the standard deviation 
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Once all these amplitudes are selected, the fc = har- 
monic is then set by the normalization condition. 

Coded in such a way, the simulations are extremely 
computationally intensive because each iteration steps 
involves an inversion of a large matrix. Therefore, 
Monte Carlo averagings given here are only over 500 
runs, fluctuation determinants are calculated according 



3.5 

3 

2.5 

a, 2 



Fit for 1 samples 
Fit for 1000 samples 
Fit for 100000 samples 
Actual distribution 



O 




^.5 
1 
0.5 

°0 0.2 0.4 0.6 0.8 1 

X 

FIG. 1. Qci found for diflterent iV at £ = 0.2. 

to Eq. (13), not using numerical path integration, and 
Qci = (l/^o) exp[— 0ci] is always used as an approxima- 
tion to Qcst- 



IV. SIMULATIONS: CORRECT PRIOR 

As an example of the algorithm's performance. Fig. (|l|) 
shows one particular learning run for 77 = 1 and £ = 0.2. 
We see that singularities and overfitting are absent even 
for N as low as 10. Moreover, the approach of Qc\{x) 
to the actual distribution P{x) is remarkably fast: for 
A'' = 10, they are similar; for N — 1000, very close; 
for N = 100000, one needs to look carefully to see the 
difference between the two. 

To quantify this similarity of distributions, we compute 
the KuUback-Leibler divergence £'KL(-P||(9ost) between 
the true distribution P{x) and its estimate Qcst{x), and 
then average over the realizations of the data points and 
the true distribution. As discussed in H, this learning 
curve A(A) measures the (average) excess cost incurred 
in coding the A^ + I'st data point because of the finiteness 
of the data sample, and thus can be called the "universal 
learning curve" . If the inference algorithm uses all of 
the information contained in the data that is relevant for 
learning ("predictive information" M), then P,p|,pO|pl] 



A{N) - (L/£)i/2';Ari/2')- 



(10) 



We test this prediction against the learning curves in 
the actual simulations. For 77 = 1 and £ = 0.4, 0.2, 0.05, 
these are shown on Fig. (g). One sees that the exponents 
are extremely close to the expected 1/2, and the ratios 
of the prefactors are within the errors from the predicted 
scaling ~ l/v^. All of this means that the proposed 
algorithm for finding densities not only works, but is at 
most a constant factor away from being optimal in using 
the predictive information of the sample set. 
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V. SIMULATIONS: WRONG PRIOR 

Next we investigate how one's choice of the prior influ- 
ences learning. We first stress that there is no such thing 
as a wrong prior. If one admits a possibility of it being 
wrong, then it does not encode all of the a priori knowl- 
edge! It does make sense, however, to ask what happens 
if the distribution we are trying to learn is an extreme 
outlier in the prior 7^ [</)]. One way to generate such an 
example is to choose a typical function from a different 
prior V'[4'], and this is what we mean by "learning with 
a wrong prior." If the prior is wrong in this sense, and 
learning is described by Eqs. (§-§), then we still expect 
the asymptotic behavior, Eq. (fOl), to hold; only the pref- 
actors of A should change, and those must increase since 
there is an obvious advantage in having the right prior; 
we illustrate this in Figs. (H, 0). 
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For Fig. (H), both -P'[0] and "Pi^] are given by Eq. (|), 
but V' has the "actual" smoothness scale ia = 0.4, 0.05, 
and for V the "learning" smoothness scale is £ = 0.2 (we 
show the case £a = i = 0-'2 again as a reference). The 
A ~ 1/VN behavior is seen unmistakably. The prefac- 
tors are a bit larger (unfortunately, insignificantly) than 
the corresponding ones from Fig. (||), so we may expect 
that the "right" £, indeed, provides better learning (see 
later for a detailed discussion). 

Further, Fig. (Q) illustrates learning when not only /, 
but also T] is "wrong" in the sense defined above. We 
illustrate this for rja — 2, 0.8, 0.6, (remember that only 
r]a > 0.5 removes UV divergences). Again, the inverse 
square root decay of A should be observed, and this is 
evident for rja — "2. The rja — 0.8,0.6,0 cases are dif- 
ferent: even for N as high as 10^ the estimate of the 
distribution is far from the target, thus the asymptotic 
regime is not reached. This is a crucial observation for 
our subsequent analysis of the smoothness scale deter- 
mination from the data. Remarkably, A (both averaged 
and in the single runs shown) is monotonic, so even in 
the cases of qualitatively less smooth distributions there 
still is no overfitting. On the other hand, A is well above 
the asymptote for -q — 2 and small N , which means that 
initially too many details are expected and wrongfully 
introduced into the estimate, but then they are almost 
immediately [N ~ 300) eliminated by the data. 



VI. SMOOTHNESS SCALE SELECTION 

Following the argument suggested in g, we now view 
■p[0], Eq. (H), as being a part of some wider model that 
involves a prior over L The details of the prior are irrel- 
evant, however, if 5'off(^), Eq. (^, has a minimum that 
steepens as N grows. We explicitly note that this mecha- 
nism is no< tuning of the prior's parameters, but Bayesian 



inference at work: (* emerges in a competition between 
the kinetic, the data, and the Occam terms to make 
5cff smaller, and thus the total probability of the data 
is larger. In its turn, larger probability means, roughly 
speaking, a shorter total code length, hence the relation 
to the MDL paradigm [Q. 

The data term, on average, is equal to iVi^KLl^HQci), 
and, for very regular P{x) (an implicit assumption in [g), 
it is small. Thus only the kinetic and the Occam terms 
matter, and £* ^ N^^^ g. For less regular distributions 
P{x), this is not true [cf. Fig. (|)]. For r; = 1, Qc\{x) 
approximates large-scale features of P{x) very well, but 
details at scales smaller than ~ ^Jl/NL axe averaged 
out. If P{x) is taken from the prior, Eq. (|^), with some 
77a, then these details fall off with the wave number k as 
- fc-''". Thus the data term is - N-^-^-v.pi.-O-^ and is 
not necessarily small. For 77^ < 1.5 this dominates the 
kinetic term and competes with the fluctuations to set 



r-TV^""-!'/"", ?7a<1.5. 



(11) 



There are two remarkable things about Eq. (|11|). First, 
for 77a = 1, £* stabilizes at some constant value, which 
we expect to be equal to £a- Second, even for ij ^ rja, 
Eqs. (^, |l^) ensure that A scales as -- Ari/2'?a-i^ which 
is at worst a constant factor away from the best scaling, 
Eq. (|l^), achievable with the "right" prior, -q = r]a- So, 
by allowing £* to vary with TV we can correctly capture 
the structure of models that are qualitatively different 
from our expectations {rj ^ rja) and produce estimates of 
Q that are extremely robust to the choice of the prior. 
To our knowledge, this feature has not been noted before 
in a reference to a nonparametric problem. 

We present simulations relevant to these predictions in 
Figs. (IHl O). Unlike on the previous Figures, the results 
are not averaged due to extreme computational costs, so 
all our further claims have to be taken cautiously. On 
the other hand, selecting f in single runs has some prac- 
tical advantages: we are able to ensure the best pos- 
sible learning for any realization of the data. Fig. (|^) 
shows single learning runs for various rja and ia- In ad- 
dition, to keep the Figure readable, we do not show runs 
with rja = 0.6,0.7,1.2,1.5,3, and ?7a — > 00, which is a 
finitely parameterizable distribution. All of these display 
a good agreement with the predicted scahngs: Eq. (|ll| ) 
for 77a < 1.5, and I* ~ N^^^ otherwise. Next we calculate 
the KL divergence between the target and the estimate 
ai i = i*] the average of this divergence over the sam- 
ples and the prior is the learning curve [cf. Eq. ([lO|)]. 
For r^Q = 0.8, 2 we plot the divergences on Fig. (||) side 
by side with their fixed .^ = 0.2 analogues. Again, the 
predictions clearly are fulfilled. Note, that for rja 7^ ?? 
there is a qualitative advantage in using the data induced 
smoothness scale. 
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VII. PARAMETERIZATION AS A WRONG 
PRIOR 

The last four Figures have illustrated some aspects of 
learning with "wrong" priors. However, all of our results 
may be considered as belonging to the "wrong prior" 
class. Indeed, the actual probability distributions we 
used were not nonparametric continuous functions with 
smoothness constraints, but were composed of kc Fourier 
modes, thus had 2kc parameters. For finite parameteri- 
zation, asymptotic properties of learning usually do not 
depend on the priors (cf. |Q,§), and prior less theories 
can be considered |0|. In such theories it would take 
well over 2kc samples to even start to close down on the 
actual value of the parameters, and yet a lot more to get 
accurate results. However, using the wrong continuous 
parameterization [0(x)] we were able to obtain good fits 
for as low as 1000 samples [cf. Fig. (|^)] with the help 
of the prior Eq. (y). Moreover, learning happened con- 
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FIG. 6. Comparison of learning speed for the same data 
sets with different a priori assumptions. 



tinuously and monotonically without huge chaotic jumps 
of overfitting that necessarily accompany any brute force 
parameter estimation method at low N. So, for some 
cases, a seemingly more complex model is actually easier 
to learn! 

Thus our claim: when data are scarce and the parame- 
ters are abundant, one gains even by using the regulariz- 
ing powers of wrong priors. The priors select some large 
scale features that are the most important to learn first 
and fill in the details as more data become available (see 
on relation of this to the Structural Risk Minimization 
theory). If the global features are dominant (arguably, 
this is generic), one actually wins in the learning speed 
[cf. Figs, (g, H, |6[)]. If, however, small scale details are 
as important, then one at least is guaranteed to avoid 
overfitting [cf. Fig. ^]. 

One can summarize this in an Occam-like fashion [g|: 
if two models provide equally good fits to data, a sim- 
pler one should always be used. In particular, the predic- 
tive information, which quantifies complexity [pi, and of 
which A is the derivative, in a QFT model is ~ N^^^^, 
and it is ~ k^. log N in the parametric case. So, for 
kc > N^/'^'^, one should prefer a "wrong" QFT formu- 
lation to the correct one. These results are very much in 
the spirit of our whole program: not only is the value of £* 
selected that simplifies the description of the data, but 
the continuous parameterization itself serves the same 
purpose. 



VIII. SUMMARY 

The field theoretic approach to density estimation not 
only regularizes the learning process but also allows the 
self-consistent selection of smoothness criteria through 
an infinite dimensional version of the Occam factors. We 
have shown numerically, and then explained analytically 
that this works, even more clearly than was conjectured: 
for ?7a < 1.5, A truly becomes a property of the data, and 
not of the Bayesian prior! If we can extend these results 
to other rja and combine this work with the reparameter- 
ization invariant formulation of [pUa], this should give a 
complete theory of Bayesian learning for one dimensional 
distributions, and this theory has no arbitrary parame- 
ters. In addition, if this theory properly treats the limit 
rya —> oo, we should be able to see how the well-studied 
finite dimensional Occam factors and the MDL principle 
arise from a more general nonparametric formulation. 
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